Air pollution is one of the most serious problems in the world. Our ultimate goal is to test the difference of API of each sensor within different years to address the changes in different places of Sofia.
There are nearly 500 sensors in Sofia. They record P2, the air pollution index, almost every second.
What we are interested in is whether the environment in different sensor locations gets worse or better in two years. So we will do a two sample T-test between the averaged PM2.5 of 2017.08-2018.07 and 2018.08-2019.07. With the p-values, we could conclude that most of sensors have significant changes except 5 sensors. And PM2.5 tends to increase in the central area, but it tends to decrease in the marginal area.
Our data is from kaggle https://www.kaggle.com/hmavrodiev/sofia-air-quality-dataset. It contains 13GB data, which is stored separately in 50 csv files. Each csv file contains information like sensor_id, observation time, longitude, and latitude within a month. Half of these csv files contains p2 which is the target response.
Part of the results of T-test
Part of the results of T-test
For each csv which contains information about air pollution within a month, we first cut each sensor out to make a sensorid_year_month.csv. Since the computation was quite large for this step, we used CHTC to parallel the computation. Then we wrote a shell file to merge those files by years for each sensor. At last, we got two files for each sensor, which contains information of air quality in 2017.8-2018.7, 2018.8-2019.7. There are some sensors which missed the data for a whole year, we deleted those sensors. After processing, there are 351 sensors.
Part of result of the two sample T-test for the 351 sensors has been stated below.
Part of the results of T-test
As the table shows, all the sensors in this table have p-values extremely smaller than 0.05. So they all have significant changes in the two years. Actually, most of the sensors have significant changes except only 5 sensors.
With the current data set, we would like to explore whether the humidity, pressure and temperature has a significant influence on the changes of PM2.5 by fitting a linear regression. However, unfortunately in our current data set, the sensors with three variables are all different from the sensors with PM2.5. So we could not further explore the relationship between these three variables and PM2.5.
From the heatmap you can find the sensor distribution in Sofia. Most of the sensors are distributed in the central area.
From the results of our tests, there are only 5 sensors with no difference between last 2 years, which are in green color. The blue dots show that PM2.5 decrease less by these sensors. The navy dots show that PM2.5 decrease more. However, the orange dots and red dots show that PM2.5 increase less and more, respectively. Roughly speaking, PM2.5 tends to increase in the central area, but it tends to decrease in the marginal area.